Spring 2018
Spatial Analysis begins with an exploration of the data.
Mapping to see its location and distribution
Asking questions of, or querying, your data.
Cleaning & reshaping the data
Applying analysis methods
Mapping analysis results
Repeat as needed
There are two key types of spatial queries
These types are often combined, e.g.
So far we have explored housing values for the city of San Francisco.
The data set consists of a lot of dissaggregated features represented as points.
In this section we will
Let's load the R libraries we will use
library(sp) library(rgdal) library(rgeos) library(tmap) library(RColorBrewer) library(ggplot2) library(ggmap) ## leaflet, ggmap, maptools, ??
## Warning: package 'rgdal' was built under R version 3.4.3
## rgdal: version: 1.2-16, (SVN revision 701) ## Geospatial Data Abstraction Library extensions to R successfully loaded ## Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01 ## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgdal/gdal ## GDAL binary built with GEOS: FALSE ## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493] ## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgdal/proj ## Linking to sp version: 1.2-5
## Warning: package 'rgeos' was built under R version 3.4.2
## rgeos version: 0.3-26, (SVN revision 560) ## GEOS runtime version: 3.6.1-CAPI-1.10.1 r0 ## Linking to sp version: 1.2-5 ## Polygon checking: TRUE
## Warning: package 'tmap' was built under R version 3.4.3
## Google Maps API Terms of Service: http://developers.google.com/maps/terms.
## Please cite ggmap if you use it: see citation("ggmap") for details.
sp: Classes and methods for spatial data
rgdal: for importing, exporting and transforming spatial data
rgeos: for spatial operations and queries on geometric objects
tmap: for creating interactive web maps
RColorBrewer: for selecting predefined color palettes.
ggmap/ggplot2 for geocoding locations, mapping and plotting
Load the data if it is not already loaded
# setwd()
sfhomes <- read.csv('data/sf_properties_25ksample.csv')
sfhomes15 <- subset(sfhomes, as.numeric(SalesYear) == 2015)
class(sfhomes15)
## [1] "data.frame"
sfhomes15_sp <- sfhomes15
coordinates(sfhomes15_sp) <- c('lon','lat') # ORDER MATTERS!!
proj4string(sfhomes15_sp) <- CRS("+init=epsg:4326")
Use readOGR to load the shapefile sf_pop_by_tracts
data folderThen take a look at the data.
sp object is it?sftracts <- readOGR(dsn="./data", layer="sf_pop_by_tracts")
## OGR data source with driver: ESRI Shapefile ## Source: "./data", layer: "sf_pop_by_tracts" ## with 196 features ## It has 10 fields
class(sftracts)
## [1] "SpatialPolygonsDataFrame" ## attr(,"package") ## [1] "sp"
proj4string(sftracts)
## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
summary(sftracts)
## Object of class SpatialPolygonsDataFrame ## Coordinates: ## min max ## x -123.01392 -122.32801 ## y 37.69274 37.86334 ## Is projected: FALSE ## proj4string : ## [+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0] ## Data attributes: ## STATEFP COUNTYFP TRACTCE AFFGEOID ## 06:196 075:196 010100 : 1 1400000US06075010100: 1 ## 010200 : 1 1400000US06075010200: 1 ## 010300 : 1 1400000US06075010300: 1 ## 010400 : 1 1400000US06075010400: 1 ## 010500 : 1 1400000US06075010500: 1 ## 010600 : 1 1400000US06075010600: 1 ## (Other):190 (Other) :190 ## GEOID NAME LSAD ALAND ## 06075010100: 1 101 : 1 CT:196 Min. : 56564 ## 06075010200: 1 102 : 1 1st Qu.: 294618 ## 06075010300: 1 103 : 1 Median : 409830 ## 06075010400: 1 104 : 1 Mean : 619651 ## 06075010500: 1 105 : 1 3rd Qu.: 660393 ## 06075010600: 1 106 : 1 Max. :6108837 ## (Other) :190 (Other):190 ## AWATER pop14 ## Min. : 0 Min. : 0 ## 1st Qu.: 0 1st Qu.: 3100 ## Median : 0 Median : 4111 ## Mean : 2082654 Mean : 4230 ## 3rd Qu.: 0 3rd Qu.: 5080 ## Max. :247501271 Max. :12391 ##
head(sftracts@data)
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD ## 0 06 075 010700 1400000US06075010700 06075010700 107 CT ## 1 06 075 012201 1400000US06075012201 06075012201 122.01 CT ## 2 06 075 013102 1400000US06075013102 06075013102 131.02 CT ## 3 06 075 013500 1400000US06075013500 06075013500 135 CT ## 4 06 075 015500 1400000US06075015500 06075015500 155 CT ## 5 06 075 016300 1400000US06075016300 06075016300 163 CT ## ALAND AWATER pop14 ## 0 183170 0 5311 ## 1 92048 0 4576 ## 2 237886 0 2692 ## 3 235184 0 2592 ## 4 311339 0 3918 ## 5 245867 0 4748
plot(sftracts) # or qtm(sftracts)
You can subset a SPDF just like a DF
Select all features with population > 0
Then replot the map
sftracts<- subset(sftracts, pop14 > 0) plot(sftracts)
We need to spatially join the sftracts and sfhomes15_sp to answer this.
A spatial join associates rows of data in one object with rows in another object based on the spatial relationship between the two objects.
A spatial join is based on the comparison of two sets of geometries in the same coordinate space.
This is called a spatial overlay.
Spatial overlay operations in R are implemented using the over function in the SP and rGEOS libraries.
Point-polygon overlay use SP::over
SpatialLines objects, or pairs of SpatialPolygons require package rgeos, and use gIntersects.
That's likely more detail than you need right now but the point here is that rgeos is the go-to library for vector geometry processing in R.
overYou can interperet over(x,y) as:
You can interperet over(x,y, returnList=TRUE) as:
See ?over for details.
In what tract is each SF property is located?
homes_with_tracts <- over(sfhomes15_sp, sftracts)
If not, why not?
The over function, like almost all spatial analysis functions, requires that both data sets be spatial objects (they are) with the same coordinate reference system (CRS). Let's investigate
# What is the CRS of the property data? proj4string(sfhomes15_sp) # What is the CRS of the census tracts? proj4string(sftracts)
Both data are in a geographic CRS but they are different - WGS84 (sfhomes15_sp) and NAD83 (sftracts)
Transform the CRS of sftracts to that of sfhomes15_sp
Then test that they are identical
# Transform the CRS for tracts to be the same as that for sfhomes15_sp sftracts2 <- spTransform(sftracts, CRS(proj4string(sfhomes15_sp))) # make sure the CRSs are the same proj4string(sftracts2) == proj4string(sfhomes15_sp)
## [1] TRUE
Now let's try that overlay operation again
# Now try the overlay operation again # Identify the tract for each property in sfhomes15_sp homes_with_tracts <- over(sfhomes15_sp, sftracts2)
What is our output? Does it answer our question?
What type of data object did the over function return?
head(homes_with_tracts) # take a look at the output class(homes_with_tracts) nrow(homes_with_tracts) nrow(sftracts2) nrow(sfhomes15_sp)
over discussionOur output homes_with_tracts is a data frame that contains - the id of each property in sfhomes15_sp - all of the columns from sftracts2@data including the census tract id (GEOID)
So we are close to answering our question.
But for the data to be useful we need - to link (or join) the GEOID to the sfhomes15_sp object
homes_with_tracts <- homes_with_tracts[c("GEOID")]
We can use the base cbind (column bind) command to join the data frame to the SpatialPointsDataFrame.
The successful use of this function requires the data to be in the same order!
sfhomes15_sp@data <-cbind(sfhomes15_sp@data, homes_with_tracts) ## NOTE - binding to the @data slot! # Review and note the change # head(sfhomes15_sp@data)
tmapMap the data in tmap interactive mode
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(sftracts2) + tm_polygons(col="beige") + tm_shape(sfhomes15_sp) + tm_dots(col="red")
Did the over command successfully associate the census tract GEIOD with each property?
If yes, you now could link the property data to census demographic data by GEOID.
We just did what's called a type of spatial overlay query called a Point-in-Polygon query.
We asked "In what tract is each property located?".
Data linkage via space!
We now know the tract for each property.
Now let's think about this question from the tract perspective.
Let's ask the question
Since we joined GEOID to each property we can use the non-spatial aggregate function to compute the mean of totvalues for each GEOID.
mean_totvalue_by_tract <- aggregate(totvalue ~ GEOID, sfhomes15_sp, mean) # Take a look head(mean_totvalue_by_tract)
## GEOID totvalue ## 1 06075010100 1315017 ## 2 06075010200 1182728 ## 3 06075010300 1739681 ## 4 06075010400 1199131 ## 5 06075010500 1537299 ## 6 06075010600 1657004
So that it is clear that it is the mean for the tract!
colnames(mean_totvalue_by_tract) <- c("GEOID","mean_totvalue")
head(mean_totvalue_by_tract)
## GEOID mean_totvalue ## 1 06075010100 1315017 ## 2 06075010200 1182728 ## 3 06075010300 1739681 ## 4 06075010400 1199131 ## 5 06075010500 1537299 ## 6 06075010600 1657004
However, we can't map our data frame of mean_totvalues by GEOID.
We can use sp::merge to join the mean_totvalue_by_tract DF to the sftracts2 SPDF.
We should make sure that
the number of rows in sftracts2 and mean_totvalue_by_tract are the same
they share a column of common values - GEOID
When we join two data objects based on values in a column it is called a data table join by attribute.
The sp:merge makes this syntax simple for sp objects with @data slots.
sftracts2<- merge(sftracts2, mean_totvalue_by_tract,
by.x="GEOID", by.y="GEOID")
IMPORTANT: DO NOT merge the DF to the @data slot! but rather to the SPDF!
## Don't do this:
sftracts2@data <- merge(sftracts2@data, mean_totvalue_by_tract,
by.x="GEOID", by.y="GEOID")
head(sftracts2@data)
## GEOID STATEFP COUNTYFP TRACTCE AFFGEOID NAME LSAD ## 7 06075010700 06 075 010700 1400000US06075010700 107 CT ## 20 06075012201 06 075 012201 1400000US06075012201 122.01 CT ## 36 06075013102 06 075 013102 1400000US06075013102 131.02 CT ## 40 06075013500 06 075 013500 1400000US06075013500 135 CT ## 45 06075015500 06 075 015500 1400000US06075015500 155 CT ## 54 06075016300 06 075 016300 1400000US06075016300 163 CT ## ALAND AWATER pop14 mean_totvalue ## 7 183170 0 5311 1862440.8 ## 20 92048 0 4576 614217.4 ## 36 237886 0 2692 1286435.1 ## 40 235184 0 2592 1493924.9 ## 45 311339 0 3918 873522.1 ## 54 245867 0 4748 1446370.5
Create an interactive tmap of census tracts colored by mean_totvalue
tm_shape(sftracts2) + tm_polygons(col="mean_totvalue", style="jenks")
Above we asked "what is the census tract id for each property?"
We then used the non-spatial aggregate function to calculate the mean totvalue for each tract.
Finally, we did a spatial merge to join this results to the census tracts.
However, we can ask the same from the tract perspective using sp::aggregate
Compute mean totvalue by census tract
tracts_with_property_count <- aggregate(x = sfhomes15_sp["totvalue"],
by = sftracts2, FUN = mean)
class(tracts_with_property_count)
## [1] "SpatialPolygonsDataFrame" ## attr(,"package") ## [1] "sp"
head(tracts_with_property_count@data)
## totvalue ## 0 1862440.8 ## 1 614217.4 ## 2 1286435.1 ## 3 1493924.9 ## 4 873522.1 ## 5 1446370.5
plot(tracts_with_property_count)
tracts_with_property_count$GEOID <- sftracts2$GEOID
Did both methods get the same result?
Check the mean totvalue for the same GEOID in each SPDF
tracts_with_property_count[tracts_with_property_count$GEOID == "06075010700", ]$totvalue
## [1] 1862441
sftracts2[sftracts2$GEOID == "06075010700",]$mean_totvalue
## [1] 1862441
Many methods of spatial analysis are based on distance queries.
For example, point pattern analysis considers the distance between features to determine whether or not they are clustered.
We can also use distance as a way to select features spatially.
Let's compute the mean property value within 1 km of Coit Tower
First, we need to know where Coit Tower is. How can we find that out?
library(ggmap)
library(ggplot2)
coit_tower <- geocode('Coit Tower, San Francisco, CA')
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Coit%20Tower%2C%20San%20Francisco%2C%20CA
coit_tower_pt <- c(coit_tower$lon, coit_tower$lat) # -122.40582, 37.80239 coit_tower_pt
## [1] -122.40582 37.80239
In order to select properties with 1KM of Coit Tower we - create a 1KM radius buffer polygon around the Coit Tower point
We then do a point-in-polygonn operation to either count the number of properties within the buffer or compute the mean totvalue.
rgeos is another powerful and widely used library for working with geospatial data.
It is the muscle for - creating new geometries from exisiting ones - calculating spatial metrics like area, length, distance - calculating the spatial relationship between two geometries.
We can use the rgeos::gBuffer function to create our buffer polygon
coordinates(coit_tower) <- c("lon","lat")
proj4string(coit_tower)<- CRS("+init=epsg:4326")
coit_tower_utm <- spTransform(coit_tower, CRS("+init=epsg:26910"))
library(rgeos)
coit_km_buffer <- gBuffer(coit_tower_utm, width=1000)
tm_shape(sfhomes15_sp) + tm_dots(col="blue") + tm_shape(coit_km_buffer) + tm_borders(col="red", lwd=2) + tm_shape(coit_tower_utm) + tm_dots(col="red")
Now that we have our buffer polygon, how can we compute the mean totvalue of properties within the buffer?
buff_mean <- aggregate(x = sfhomes15_sp["totvalue"],
by = coit_km_buffer, FUN = mean)
coit_buffer_lonlat <- spTransform(coit_km_buffer,
CRS(proj4string(sfhomes15_sp)))
buff_mean <- aggregate(x = sfhomes15_sp["totvalue"],
by = coit_buffer_lonlat, FUN = mean)
What is the mean property value within 1KM of Coit Tower in 2015?
View(buff_mean)
That was a whirlwind tour of just some of the methods of spatial analysis.
There was a lot we didn't and can't cover.
Raster data is a another major topic! - but the raster package is the key
library(knitr)
purl("r-geospatial-workshop-f2017-pt2.Rmd", output = "scripts/r-geospatial-workshop-f2017-pt2.r", documentation = 1)